A robust method for partitioning the values of categorical attributes
نویسنده
چکیده
Résumé. Dans le domaine de l’apprentissage supervisé, les méthodes de groupage des modalités d’un attribut symbolique permettent de construire un nouvel attribut synthétique conservant au maximum la valeur informationnelle de l’attribut initial et diminuant le nombre de modalités. Nous proposons ici une généralisation de l’algorithme de discrétisation Khiops pour le problème du groupage des modalités. L’algorithme proposé permet de contrôler a priori le risque de sur-apprentissage et d’améliorer significativement la robustesse des groupages produits. Cette caractéristique de robustesse a été obtenue en étudiant la statistique des variations du critère du Khi2 lors de regroupements de lignes d’un tableau de contingence et en modélisant le comportement statistique de l’algorithme Khiops. Des expérimentations intensives ont permis de valider cette approche et ont montré que la méthode de groupage Khiops aboutit à des groupages performants, à la fois en terme de qualité prédictive et de faible nombre de groupes.
منابع مشابه
A Comparative Study between a Pseudo-Forward Equation (PFE) and Intelligence Methods for the Characterization of the North Sea Reservoir
This paper presents a comparative study between three versions of adaptive neuro-fuzzy inference system (ANFIS) algorithms and a pseudo-forward equation (PFE) to characterize the North Sea reservoir (F3 block) based on seismic data. According to the statistical studies, four attributes (energy, envelope, spectral decomposition and similarity) are known to be useful as fundamental attributes in ...
متن کاملA Divisive Ordering Algorithm for Mapping Categorical Data to Numeric Data
The amount of computing time for K Nearest Neighbor Search is linear to the size of the dataset if the dataset is not indexed. This is not endurable for on-line applications with time constraints when the dataset is large. However, if there are categorical attributes in the dataset, an index cannot be built on the dataset. One possible solution to index such datasets is to convert categorical a...
متن کاملA Bayes Optimal Approach for Partitioning the Values of Categorical Attributes
In supervised machine learning, the partitioning of the values (also called grouping) of a categorical attribute aims at constructing a new synthetic attribute which keeps the information of the initial attribute and reduces the number of its values. In this paper, we propose a new grouping method MODL founded on a Bayesian approach. The method relies on a model space of grouping models and on ...
متن کاملInterval MULTIMOORA method with target values of attributes based on interval distance and preference degree: biomaterials selection
A target-based MADM method covers beneficial and non-beneficial attributes besides target values for some attributes. Such techniques are considered as the comprehensive forms of MADM approaches. Target-based MADM methods can also be used in traditional decision-making problems in which beneficial and non-beneficial attributes only exist. In many practical selection problems, some attributes ha...
متن کاملContext-Based Distance Learning for Categorical Data Clustering
Clustering data described by categorical attributes is a challenging task in data mining applications. Unlike numerical attributes, it is difficult to define a distance between pairs of values of the same categorical attribute, since they are not ordered. In this paper, we propose a method to learn a context-based distance for categorical attributes. The key intuition of this work is that the d...
متن کاملA Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining
Partitioning a large set of objects into homogeneous clusters is a fundamental operation in data mining. The k-means algorithm is best suited for implementing this operation because of its efficiency in clustering large data sets. However, working only on numeric values limits its use in data mining because data sets in data mining often contain categorical values. In this paper we present an a...
متن کامل